Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825
Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825herley-shaori wants to merge 4 commits intoapache:mainfrom
Conversation
…ams disabled (apache#15812) When outputUnigrams=false, CJKBigramFilter produced different token positions compared to outputUnigrams=true. A bigram spans two character positions but only advanced the position counter by 1. After a word break (punctuation, whitespace, or non-CJK text), subsequent tokens were assigned incorrect positions, breaking phrase queries in combined unigram+bigram indexing strategies. The fix tracks whether bigrams were emitted from the current CJK segment and defers an extra position increment (+1) to apply to the first token after a segment boundary. This ensures outputUnigrams=false behaves "as if unigrams were emitted then removed", keeping positions aligned across both settings. Example: "一二、三" Before: 一二(pos0) 三(pos1) — wrong, positions don't match After: 一二(pos0) 三(pos2) — correct, matches outputUnigrams=true
|
Looks great! I think CJKAnalyzer may use this filter, and now it's position increments will have changed. Can you glance at the failing tests? |
…mFilter behavior CJKAnalyzer uses CJKBigramFilter with outputUnigrams=false, so its tests need the same position increment updates applied to TestCJKBigramFilter and TestWithCJKBigramFilter: after a CJK bigram segment boundary, the next token now correctly gets positionIncrement=2 instead of 1. Updates testJa2, testMix, testMix2, testReusableTokenStream, and testFinalOffset.
Done! All tests have passed. |
| termAtt.setLength(len); | ||
| offsetAtt.setOffset(startOffset[index], endOffset[index]); | ||
| typeAtt.setType(SINGLE_TYPE); | ||
| if (!outputUnigrams && deferredPosInc > 0) { |
There was a problem hiding this comment.
this part of the if seems unnecessary, since deferredPosInc is never incremented unless !outputUnigrams.
| if (!outputUnigrams && deferredPosInc > 0) { | |
| if (deferredPosInc > 0) { |
There was a problem hiding this comment.
Thanks for the review! Applied your suggestion and extended the same reasoning to the other guards:
- flushUnigram(): removed !outputUnigrams && (your suggestion)
- flushBigram(): added if (deferredPosInc > 0) guard to skip the redundant setPositionIncrement(1) when
clearAttributes() already defaults to 1 - incrementToken() (both segment boundary checks): removed !outputUnigrams && before hadBigrams — since hadBigrams is only ever set true inside the !outputUnigrams branch of flushBigram(), the outer check is redundant.
Also fixed TestCJKAnalyzer (testJa2, testMix, testMix2, testReusableTokenStream, testFinalOffset) — same position increment updates needed since CJKAnalyzer uses CJKBigramFilter with outputUnigrams=false.
deferredPosInc is only ever incremented when !outputUnigrams, so the extra condition is unnecessary. Suggested by rmuir in review.
hadBigrams is only ever set true inside the !outputUnigrams branch of flushBigram(), so checking !outputUnigrams before testing hadBigrams is redundant. Same reasoning applies to the deferredPosInc guard in flushBigram() — clearAttributes() already defaults posInc to 1, so we only need to call setPositionIncrement when deferredPosInc > 0.
Summary
Fixes #15812
CJKBigramFilterproduces different token positions for the same input depending on whetheroutputUnigramsistrueorfalse. This breaks phrase queries when index-time and search-time analyzers use differentoutputUnigramssettings — a common optimization pattern for CJK search.Root cause
In
flushBigram(), whenoutputUnigrams=false, bigrams are emitted with the defaultpositionIncrement=1, but a bigram conceptually spans two character positions. After a word break (punctuation, whitespace, or non-CJK text), subsequent tokens are assigned positions that are off by 1 compared to theoutputUnigrams=truecase.Example with input
"一二、三":Fix
Following the principle suggested by @rmuir —
outputUnigrams=falseshould behave as if unigrams were emitted, then later removed — this PR tracks whether bigrams were emitted from the current CJK segment and defers an extra position increment (+1) to apply to the first token after a segment boundary.Two new fields in
CJKBigramFilter:hadBigrams: settruewhen a bigram is flushed in no-unigram modedeferredPosInc: accumulated extra position increment, applied at the next segment transition (unaligned offsets, non-CJK token, or end of stream)The deferred increment is applied in
flushBigram(),flushUnigram(), and the non-CJK passthrough path inincrementToken().Changes
CJKBigramFilter.java: Added position tracking logic across CJK segment boundariesTestCJKBigramFilter.java: Added 3 new test cases reproducing the bug; updatedtestHanOnlyexpected positionsTestWithCJKBigramFilter.java(ICU): Updated expected positions intestJa2,testMix,testMix2,testReusableTokenStream, andtestFinalOffsetCHANGES.txt: Added bug fix entryTest plan
./gradlew tidytestBigramPositionsConsistentAcrossWordBreak— reproduces exact scenario from issuetestBigramPositionsMultipleSegments— verifies across multiple CJK segments with breakstestBigramPositionsBeforeNonCJK— verifies CJK bigram followed by non-CJK text